Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs

109

• no connection (zero)

• skip connection (identity)

• 3 × 3 dilated convolution with rate 2

• 5 × 5 dilated convolution with rate 2

• 3 × 3 max pooling

• 3 × 3 average pooling

• 3 × 3 depth-wise separable convolution

• 5 × 5 depth-wise separable convolution

We replace the separable convolution depth-wise with a binarized form, i.e., binarized

weights and activations. Skip connection is an identity mapping in NAS, instead of an

additional shortcut. Optimizing BNNs is more challenging than conventional CNNs [77, 199],

as binarization adds additional burdens to NAS. Following [151], to reduce the undesirable

ﬂuctuation in performance evaluation, we normalize the architecture parameter of the M

operations for each edge to obtain the ﬁnal architecture indicator as

ˆo⁽^i,j⁾

(a⁽^j⁾) =

exp{α⁽^i,j⁾

}

m^′^exp^{^α⁽^i,j⁾

m^′^}

o⁽^i,j⁾

(a⁽^j⁾)

(4.28)

4.4.4

Tangent Propagation for DCP-NAS

In this section, we propose the generation of the tangent direction based on the Parent

model and then present the tangent propagation to search for the optimized architecture

in the binary NAS eﬀectively. As shown in Fig. 4.12, the novelty of DCP-NAS introduces

tangent propagation and decoupled optimization, leading to a practical discrepancy-based

search framework. The main motivation of DCP-NAS is to “ﬁne-tune” the Child model

architecture based on the real-valued Parent rather than directly binarizing the Parent.

Thus, we ﬁrst take advantage of the Parent model to generate

the tangent direction from the architecture gradient of the model as

∂^˜f ^C(w, α)

∂α

n=1

∂pn(w, α)

∂α

(4.29)

where ^˜f(w, α) is predeﬁned in Eq. 4.21.

Then we conduct the second step, i.e., tangent propagation for the child model.

For each epoch of binary NAS in our DCP-NAS, we inherit weights from the real-valued

architecture ˆα ←α and enforce the binary network to learn distributions similar to those

of real-valued networks.

max

ˆw∈W,ˆα∈A,β∈R⁺^G^{( ˆ}^w^,^ˆ^{α, β}^{) = ˜}^f^P⁽^w^{, α}⁾^log

˜f ^P (w, α)

˜f ^C

b ^{( ˆ}^w^,^ˆ^{α, β}⁾

n=1

pn(w, α) log( ^ˆ^pⁿ^{( ˆ}^w^,^ˆ^{α, β}⁾

pn(w, α) ⁾^,

(4.30)

where the KL divergence is used to supervise the binary search process. G( ˆw, ˆα, β) calculates

the similarity of the output logits between the real value network p(·) and the binary network

ˆp(·), where the teacher’s output is already given.

To further optimize the binary architecture, we constrain the gradient of binary NAS

using the tangent direction as

min

ˆα∈A ^D^(ˆ^α^{) =}^∥^∂^˜^f^P⁽^w^{, α}⁾

∂α

−^∂G^{( ˆ}^w^,^ˆ^{α, β}⁾

∂ˆα

∥²

(4.31)